Posts with tag Changes in language over time
← Back to all posts
Following up on my previous topic modeling post, I want to talk about one thing humanists actually do with topic models once they build them, most of the time: chart the topics over time. Since I think that, although Topic Modeling can be very useful, there’s too little skepticism about the technique, I’m venturing to provide it (even with, I’m sure, a gross misunderstanding or two). More generally, the sort of mistakes temporal changes cause should call into question the complacency with which humanists tend to ’topics’ in topic modeling as stable abstractions, and argue for a much greater attention to the granular words that make up a topic model.
Before end-of-semester madness, I was looking at how shifts in vocabulary usage occur. In many cases, I found, vocabulary change doesn’t happen evenly across across all authors. Instead, it can happen generationally; older people tend to use words at the rate that was common in their youth, and younger people anticipate future word patterns. An eighty-year-old in 1880 uses a world like “outside” more like a 40-year-old in 1840 than he does like a 40-year-old in 1880. The original post has a more detailed explanation.
I’m interested in the ways different words are tied together. That’s sort of the universal feature of this project, so figuring out ways to find them would be useful. I already looked at some ways of finding interesting words for “scientific method,” but that was in the context of the related words as an endpoint of the analysis. I want to be able to automatically generate linked words, as well. I’m going to think through this staying on “capitalist” as the word of the day. Fair warning: this post is a rambler.
This verges on unreflective datadumping: but because it’s easy and I think people might find it interesting, I’m going to drop in some of my own charts for total word use in 30,000 books by the largest American publishers on the same terms for which the Times published Cohen’s charts of title word counts. I’ve tossed in a couple extra words where it seems interesting—including some alternate word-forms that tell a story, using a perl word-stemming algorithm I set up the other day that works fairly well. My charts run from 1830 (there just aren’t many American books from before, and even the data from the 30s is a little screwy) to 1922 (the date that digital history ends–thank you, Sonny Bono.) In some cases, (that 1874 peak for science), the American and British trends are surprisingly close. Sometimes, they aren’t.
So I just looked at patterns of commemoration for a few famous anniversaries. This is, for some people, kind of interesting–how does the publishing industry focus in on certain figures to create news or resurgences of interest in them? I love the way we get excited about the civil war sesquicentennial now, or the Darwin/Lincoln year last year.
I’m going to keep looking at the list of isms, because a) they’re fun; and b) the methods we use on them can be used on any group of words–for example, ones that we find are highly tied to evolution. So, let’s use them as a test case for one of the questions I started out with: how can we find similarities in the historical patterns of emergence and submergence of words?
Here’s a fun way of using this dataset to convey a lot of historical information. I took all the 414 words that end in ism in my database, and plotted them by the year in which they peaked,* with the size proportional to their use at peak. I’m going to think about how to make it flashier, but it’s pretty interesting as it is. Sample below, and full chart after the break.
I can’t resist making a few more comments on that technologies graph that I laid out. I’m going to add a few thousand more books to the counts overnight, so I won’t make any new charts until tomorrow, but look at this one again.
Let’s start with just some of the basic wordcount results. Dan Cohen posted some similar things for the Victorian period on his blog, and used the numbers mostly to test hypotheses about change over time. I can give you a lot more like that (I confirmed for someone, though not as neatly as he’d probably like, that ‘ business ’ became a much more prevalent word through the 19C). But as Cohen implies, such charts can be cooler than they are illuminating.